Mining for Outliers in Sequential Databases

نویسندگان

  • Pei Sun
  • Sanjay Chawla
  • Bavani Arunasalam
چکیده

The mining of outliers (or anomaly detection) in large databases continues to remain an active area of research with many potential applications. Over the last several years many novel methods have been proposed to efficiently and accurately mine for outliers. In this paper we propose a unique approach to mine for sequential outliers using Probabilistic Suffix Trees (PST). The key insight that underpins our work is that we can distinguish outliers from non-outliers by only examining the nodes close to the root of the PST. Thus, if the goal is to just mine outliers, then we can drastically reduce the size of the PST and reduce its construction and query time. In our experiments, we show that on a real data set consisting of protein sequences, by retaining less than 5% of the original PST we can retrieve all the outliers that were reported by the full-sized PST. We also carry out a detailed comparison between two measures of sequence similarity: the normalized probability and the odds and show that while the current research literature in PST favours the odds, for outlier detection it is normalized probability which gives far superior results. We provide an information theoretic argument based on entropy to explain the success of the normalized probability measure. Finally, we describe a more efficient implementation of the PST algorithm, which dramatically reduces its construction time compared to the implementation of Bejerano [3].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Outlier Detection in High Dimensional, Spatial and Sequential Data Sets

Of all the data mining techniques, outlier detection seems closest to the definition of “discovering nuggets of information” in large databases. When an outlier is detected, and determined to be genuine, it can provide insights, which can radically change our understanding of the underlying process. The purpose of the research underlying this thesis was to investigate and devise methods to mine...

متن کامل

Exploring multi-dimensional sequential patterns across multi-dimensional multi-sequence databases

Existing multi-dimensional sequential pattern mining methods only discover multi-dimensional sequential pattern in databases involving one sequential dimension. Since multi-dimensional sequential patterns may exist in databases containing more than one sequential dimension, in this paper, we present algorithm PSeq-MIDim for mining multi-dimensional sequential patterns from multiple sequential d...

متن کامل

Mining Outliers in Spatial Networks

Outlier analysis is an important task in data mining and has attracted much attention in both research and applications. Previous work on outlier detection involves different types of databases such as spatial databases, time series databases, biomedical databases, etc. However, few of the existing studies have considered spatial networks where points reside on every edge. In this paper, we stu...

متن کامل

Robust Decision Trees: Removing Outliers from Databases

Finding and removing outliers is an important problem in data mining. Errors in large databases can be extremely common, so an important property of a data mining algorithm is robustness with respect to errors in the database. Most sophisticated methods in machine learning address this problem to some extent, but not fully, and can be improved by addressing the problem more directly. In this pa...

متن کامل

Abstract—Mining Sequential Patterns in large databases has become

Mining Sequential Patterns in large databases has become an important data mining task with broad applications. It is an important task in data mining field, which describes potential sequenced relationships among items in a database. There are many different algorithms introduced for this task. Conventional algorithms can find the exact optimal Sequential Pattern rule but it takes a long time,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006